Universities and Gentrification

UMD CMSC320 Data Science, Spring 2023

Joe Diaz and Connor Pymm


🙏RUN ME FIRST🙏¶

In [1]:
%pip install plotly
%pip install "jupyterlab>=3" "ipywidgets>=7.6"
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook'

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.14.1)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from plotly) (23.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.2.2)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: jupyterlab>=3 in /opt/conda/lib/python3.10/site-packages (3.5.3)
Requirement already satisfied: ipywidgets>=7.6 in /opt/conda/lib/python3.10/site-packages (8.0.4)
Requirement already satisfied: jinja2>=2.1 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (3.1.2)
Requirement already satisfied: tomli in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (2.0.1)
Requirement already satisfied: jupyter-server<3,>=1.16.0 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (2.1.0)
Requirement already satisfied: jupyterlab-server~=2.10 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (2.16.6)
Requirement already satisfied: ipython in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (8.8.0)
Requirement already satisfied: notebook<7 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (6.5.2)
Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (5.1.4)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (23.0)
Requirement already satisfied: nbclassic in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (0.4.8)
Requirement already satisfied: tornado>=6.1.0 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (6.2)
Requirement already satisfied: widgetsnbextension~=4.0 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (4.0.5)
Requirement already satisfied: traitlets>=4.3.1 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (5.8.1)
Requirement already satisfied: jupyterlab-widgets~=3.0 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (3.0.5)
Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (6.20.2)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (5.9.4)
Requirement already satisfied: comm>=0.1.1 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (0.1.2)
Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (25.0.0)
Requirement already satisfied: debugpy>=1.0 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (1.6.6)
Requirement already satisfied: matplotlib-inline>=0.1 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (0.1.6)
Requirement already satisfied: nest-asyncio in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (1.5.6)
Requirement already satisfied: jupyter-client>=6.1.12 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (7.4.9)
Requirement already satisfied: pickleshare in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.7.5)
Requirement already satisfied: decorator in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (5.1.1)
Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.11 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (3.0.36)
Requirement already satisfied: backcall in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.2.0)
Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (4.8.0)
Requirement already satisfied: stack-data in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.6.2)
Requirement already satisfied: pygments>=2.4.0 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (2.14.0)
Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.18.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2>=2.1->jupyterlab>=3) (2.1.2)
Requirement already satisfied: send2trash in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.8.0)
Requirement already satisfied: terminado>=0.8.3 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.17.1)
Requirement already satisfied: websocket-client in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.4.2)
Requirement already satisfied: anyio<4,>=3.1.0 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (3.6.2)
Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.16.0)
Requirement already satisfied: jupyter-events>=0.4.0 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.6.3)
Requirement already satisfied: nbformat>=5.3.0 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (5.7.3)
Requirement already satisfied: jupyter-server-terminals in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.4.4)
Requirement already satisfied: nbconvert>=6.4.4 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (7.2.8)
Requirement already satisfied: argon2-cffi in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (21.3.0)
Requirement already satisfied: platformdirs>=2.5 in /opt/conda/lib/python3.10/site-packages (from jupyter-core->jupyterlab>=3) (2.6.2)
Requirement already satisfied: json5>=0.9.0 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (0.9.5)
Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (4.16.0)
Requirement already satisfied: babel>=2.10 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (2.11.0)
Requirement already satisfied: requests>=2.28 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (2.28.2)
Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.10/site-packages (from notebook<7->jupyterlab>=3) (0.2.0)
Requirement already satisfied: notebook-shim>=0.1.0 in /opt/conda/lib/python3.10/site-packages (from nbclassic->jupyterlab>=3) (0.2.2)
Requirement already satisfied: idna>=2.8 in /opt/conda/lib/python3.10/site-packages (from anyio<4,>=3.1.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (3.4)
Requirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.10/site-packages (from anyio<4,>=3.1.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.3.0)
Requirement already satisfied: pytz>=2015.7 in /opt/conda/lib/python3.10/site-packages (from babel>=2.10->jupyterlab-server~=2.10->jupyterlab>=3) (2022.7.1)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/conda/lib/python3.10/site-packages (from jedi>=0.16->ipython->jupyterlab>=3) (0.8.3)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (0.19.3)
Requirement already satisfied: attrs>=17.4.0 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (22.2.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ipywidgets>=7.6) (2.8.2)
Requirement already satisfied: entrypoints in /opt/conda/lib/python3.10/site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ipywidgets>=7.6) (0.4)
Requirement already satisfied: python-json-logger>=2.0.4 in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.0.4)
Requirement already satisfied: rfc3339-validator in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.1.4)
Requirement already satisfied: rfc3986-validator>=0.1.1 in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.1.1)
Requirement already satisfied: pyyaml>=5.3 in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (6.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.5.0)
Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (4.11.1)
Requirement already satisfied: mistune<3,>=2.0.3 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.0.4)
Requirement already satisfied: jupyterlab-pygments in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.2.2)
Requirement already satisfied: tinycss2 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.2.1)
Requirement already satisfied: defusedxml in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.7.1)
Requirement already satisfied: nbclient>=0.5.0 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.7.2)
Requirement already satisfied: bleach in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (6.0.0)
Requirement already satisfied: fastjsonschema in /opt/conda/lib/python3.10/site-packages (from nbformat>=5.3.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.16.2)
Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.10/site-packages (from pexpect>4.3->ipython->jupyterlab>=3) (0.7.0)
Requirement already satisfied: wcwidth in /opt/conda/lib/python3.10/site-packages (from prompt-toolkit<3.1.0,>=3.0.11->ipython->jupyterlab>=3) (0.2.6)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.28->jupyterlab-server~=2.10->jupyterlab>=3) (2.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.28->jupyterlab-server~=2.10->jupyterlab>=3) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.28->jupyterlab-server~=2.10->jupyterlab>=3) (1.26.14)
Requirement already satisfied: argon2-cffi-bindings in /opt/conda/lib/python3.10/site-packages (from argon2-cffi->jupyter-server<3,>=1.16.0->jupyterlab>=3) (21.2.0)
Requirement already satisfied: pure-eval in /opt/conda/lib/python3.10/site-packages (from stack-data->ipython->jupyterlab>=3) (0.2.2)
Requirement already satisfied: executing>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from stack-data->ipython->jupyterlab>=3) (1.2.0)
Requirement already satisfied: asttokens>=2.1.0 in /opt/conda/lib/python3.10/site-packages (from stack-data->ipython->jupyterlab>=3) (2.2.1)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from asttokens>=2.1.0->stack-data->ipython->jupyterlab>=3) (1.16.0)
Requirement already satisfied: jsonpointer>1.13 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (2.3)
Requirement already satisfied: webcolors>=1.11 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.13)
Requirement already satisfied: isoduration in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (20.11.0)
Requirement already satisfied: fqdn in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.5.1)
Requirement already satisfied: uri-template in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.2.0)
Requirement already satisfied: cffi>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from argon2-cffi-bindings->argon2-cffi->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.15.1)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4->nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.3.2.post1)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from bleach->nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.5.1)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.21)
Requirement already satisfied: arrow>=0.15.0 in /opt/conda/lib/python3.10/site-packages (from isoduration->jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.2.3)
Note: you may need to restart the kernel to use updated packages.

Data Curation


Selecting Datasets¶

In order to perform analysis on colleges and their surrounding regions, we needed to find some subset of colleges, a dataset with characteristics of those colleges on a yearly basis, and then a dataset with characteristics of their nearby geographical areas. again yearly.

Initially, we decided to limit our analysis to the top 100 universities in the country according to current US News rankings, under the assumption that more highly ranked universities might have a more significant impact on their respective communities. We used Andy Reiter's “U.S. News & World Report Historical Liberal Arts College and University Rankings” dataset (citation).

In order to obtain college characteristics, we discovered that the Department of Education has extensive data available on accredited universities called the College Scorecard, which has a public API for programatically querying data.

In order to obtain characteristics of the region around each university, we needed a dataset that would contain demographic and economic data for defined geographical regions associated with the location of the University. We found that the American Community Survey yearly data from the Census had the housing cost and income data we wanted to analyze, and that its Public Use Microdata API from the census allowed us to programatically request that data for geographical groupings called "Public Use Microdata Areas," which are the smallest geographical entities that the Census collects yearly data from.

Extract, Transform, and Load¶

Since we queried a substantial amount of data from ridiculously large datasets, and requesting federal data from the Department of Education and the Census required registration for and usage of API keys, we decided that on top of the source datasets that we were able to download in full, stored in our repository under ETL/source_data, we would create modules for making federal API requests and loading the results into csv files for usage later.

Dataframes that we generated from data that we queried were stored under ETL/generated_data as csv, and then loaded into the notebook when needed, specifically: we built ScorecardData.csv using our scorecard_client.py module, which defines a CollegeScorecardClient object that can be used to query DoE data, given a valid API key, set of desired variables, and set of colleges using IPEDS IDs, we built college_FIPs by combining the university list we got from Reiter with state FIPs data from DoE and county FIPs data by collecting them manually university by university.

For the rest of this tutorial, we will be using the data we collected by default, but if you would like to recreate the analysis of this tutorial using a different set of colleges, and thus your own datasets, you can fork this repository and use the modules provided in the ETL/ directory to do so.


Data Processing


Loading and Representation¶

Here, we load the data we have downloaded or generated locally into our notebook for use to use in our analysis. We stored each of our datasets as csv, so they are easily loaded into Pandas Dataframes. We will also have to define a few new terms so that the reader can understand what we are talking about:

  1. MSA: The MSA is the metropolitan statistical area in the U.S. Census. It includes 384 census-designated regions in the U.S. with more than 50,000 people in each block. We use this data as a standard for the general region around colleges and as a sort of control variable compared to our more limited and specific PUMA.
  2. PUMA: The PUMAs stands for "Public Use Microdata Areas", which is kind of in between a county and a school system in the Census. IT was the smallest readily available geography in the Census dataset that we could convert to using zip codes and tracts, so we use it as our specific locality that we are comparing against our control (MSA). If our hypothesis is correct, these should be more volatile since the prices of colleges should be driving up prices in these localities.
  3. Tract: A Census Tract is a specific section of land, even more specific than the PUMA, that we use to connect the zip codes from the top 100 universities and the MSAs and PUMA's from the Census. They are very finnicky and hard to work with, but we found a U.S. government API that allows us to convert between Zip Codes, Census Tracts, and MSAs, which allowed us to complete our analysis.
  4. Zip Code: You likely already know what a zip code is, but working with Zip Codes was a very difficult challenge in this project since they are maintained by the U.S. Postal Service, not the Census, so neither agency has a standard measurement for localities, hence why we needed to convert from Zip -> Tract -> MSA and PUMA.
In [2]:
# Read dataframes from Scorecard generated data
scard_df = pd.read_csv("ETL/generated_data/ScorecardData.csv")
fips_df = pd.read_csv("ETL/generated_data/college_FIPs.csv")
cpi_df = pd.read_csv("ETL/source_data/cpi_all.csv").groupby("Year")["Value"].mean()
scard_df.head()
Out[2]:
student.size cost.tuition.in_state cost.tuition.out_of_state cost.avg_net_price.public cost.avg_net_price.private id school.name school.carnegie_size_setting school.zip school.region_id school.state_fips school.locale school.ownership year
0 5042.0 NaN NaN NaN NaN 131159 American University 17 20016-8001 2 11 11 2 1996
1 18337.0 NaN NaN NaN NaN 100858 Auburn University 15 36849 5 1 13 1 1996
2 10395.0 NaN NaN NaN NaN 223232 Baylor University 16 76798 6 48 12 2 1996
3 9203.0 NaN NaN NaN NaN 196079 Binghamton University 16 13850-6000 2 36 22 1 1996
4 10120.0 NaN NaN NaN NaN 164924 Boston College 17 02467 1 25 13 2 1996
In [3]:
years = range(2009, 2020)
msa_path_format = "ETL/generated_data/MSA{y}.csv"
MSA_frames = [
    pd.read_csv(msa_path_format.format(y=yr)).assign(year=yr)
    for yr in years
]
msa_df = pd.concat(MSA_frames)
msa_df.columns = ["name", "msa_income", "msa_rent", "msa", "year"]
print(msa_df["name"].unique().shape)
msa_df
(36,)
Out[3]:
name msa_income msa_rent msa year
0 CO Metro Area" 59007 1207 19740] 2009
1 NC Metro Area" 49902 921 20500] 2009
2 FL Metro Area" 37654 881 23540] 2009
3 NC Metro Area" 41272 789 24660] 2009
4 SC Metro Area" 43283 763 24860] 2009
... ... ... ... ... ...
56 CA Metro Area" 77774 1774 31080] 2019
57 WI Metro Area" 75545 1218 31540] 2019
58 DC-VA-MD-WV Metro Area" 105659 1862 47900] 2019
59 NC Metro Area" 52322 815 49180] 2019
60 MA-CT Metro Area" 76348 1319 49340]] 2019

657 rows × 5 columns

In [4]:
years = range(2009, 2020)
puma_path_format = "ETL/generated_data/PUMA{y}.csv"
PUMA_frames = [
    pd.read_csv(puma_path_format.format(y=yr)).assign(year=yr)
    for yr in years
]
puma_df = pd.concat(PUMA_frames).reset_index(drop=True)
puma_df.columns = ["puma_income", "puma_rent", "state", "puma", "year"]
puma_df["state"] = puma_df["state"].astype("int")
puma_df["puma"] = puma_df["puma"].astype("int")
puma_df["year"] = puma_df["year"].astype("int")

print(puma_df["puma"].unique().shape)
puma_df
(84,)
Out[4]:
puma_income puma_rent state puma year
0 52023 1060 6 1800 2009
1 36740 1020 6 1901 2009
2 69677 1710 6 2001 2009
3 49350 1244 6 2101 2009
4 70522 1561 6 2201 2009
... ... ... ... ... ...
3414 101524 1921 51 1302 2019
3415 51987 751 55 100 2019
3416 57284 1254 55 101 2019
3417 61443 767 55 1301 2019
3418 65851 826 55 1600 2019

3419 rows × 5 columns

Data Cleaning and Reshaping¶

The data that we have still uses the variable names and formatting of our original sources, and those variable names are unweildy and not ideal for usage in analysis later, so we rename our columns to be more human readable and developer friendly. Additionally, cost data in our sources does not account for inflation, so we should use an all-consumers/all-goods CPI to transform our dollar values to a standard value.

In [5]:
# Rename columns to be more readable, usable
scard_df = scard_df.rename(
    columns={
        "student.size": "size",
        "cost.tuition.in_state": "in_state_tuition",
        "cost.tuition.out_of_state": "out_state_tuition",
        "cost.avg_net_price.public": "public_net_price",
        "cost.avg_net_price.private": "private_net_price",
        "id": "id",
        "school.name": "name",
        "school.carnegie_size_setting": "size_setting",
        "school.zip": "zip",
        "school.state_fips": "state_fips",
        "school.region_id": "region_id",
        "school.locale": "locale",
        "school.ownership": "ownership"
    }
)

# Join county FIPs codes into College Scorecard dataframe for use later in
# associating with Census geographies.
scard_df = pd.merge(
    scard_df, 
    fips_df[["id", "county", "cbsa", "puma"]], 
    on="id", how="left").drop_duplicates()
print(scard_df)

# Combine public and private net prices into a single net price column, and drop those columns
scard_df["net_cost"] = scard_df.apply(lambda row: 
            row["public_net_price"] if (row["ownership"] == 1) else row["private_net_price"],
        axis=1
)
scard_df["net_cost_adjusted"] = scard_df.apply(lambda row: 
            (row["net_cost"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["in_tuition_adjusted"] = scard_df.apply(lambda row: 
            (row["in_state_tuition"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["out_tuition_adjusted"] = scard_df.apply(lambda row: 
            (row["out_state_tuition"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["non_tuition_expenses_adj"] = scard_df.apply(lambda row: 
            row["net_cost_adjusted"] - row["in_tuition_adjusted"],
        axis=1
)



scard_df["state_fips"] = scard_df["state_fips"].astype(int)
scard_df["year"] = scard_df["year"].astype(int)
scard_df.drop(["public_net_price", "private_net_price"], axis=1, inplace=True)
scard_df.head()
         size  in_state_tuition  out_state_tuition  public_net_price  \
0      5042.0               NaN                NaN               NaN   
1     18337.0               NaN                NaN               NaN   
2     18337.0               NaN                NaN               NaN   
3     10395.0               NaN                NaN               NaN   
4      9203.0               NaN                NaN               NaN   
...       ...               ...                ...               ...   
3820   7366.0           57386.0            57386.0               NaN   
3821   6217.0           23628.0            46854.0           19593.0   
3822   4804.0           54416.0            54416.0               NaN   
3823   4701.0           57700.0            57700.0               NaN   
3824   2619.0           46475.0            46475.0               NaN   

      private_net_price      id                               name  \
0                   NaN  131159                American University   
1                   NaN  100858                  Auburn University   
2                   NaN  100858                  Auburn University   
3                   NaN  223232                  Baylor University   
4                   NaN  196079              Binghamton University   
...                 ...     ...                                ...   
3820            26921.0  179867  Washington University in St Louis   
3821                NaN  231624                     William & Mary   
3822            42835.0  168421    Worcester Polytechnic Institute   
3823            15296.0  130794                    Yale University   
3824            39536.0  197708                 Yeshiva University   

      size_setting         zip  region_id  state_fips  locale  ownership  \
0               17  20016-8001          2          11      11          2   
1               15       36849          5           1      13          1   
2               15       36849          5           1      13          1   
3               16       76798          6          48      12          2   
4               16  13850-6000          2          36      22          1   
...            ...         ...        ...         ...     ...        ...   
3820            17  63130-4899          4          29      21          2   
3821            14  23187-8795          5          51      23          1   
3822            13  01609-2280          1          25      12          2   
3823            17       06520          1           9      12          2   
3824            14  10033-3299          2          36      11          2   

      year  county     cbsa     puma  
0     1996     1.0  47900.0    101.0  
1     1996    81.0  12220.0   2101.0  
2     1996    81.0  10760.0   2101.0  
3     1996   309.0  47380.0   3801.0  
4     1996     7.0  13780.0   2201.0  
...    ...     ...      ...      ...  
3820  2020   510.0  41180.0   2001.0  
3821  2020   830.0  47260.0   9500.0  
3822  2020    27.0  49340.0    505.0  
3823  2020     9.0  35300.0  20601.0  
3824  2020    61.0  35620.0   4112.0  

[3825 rows x 17 columns]
Out[5]:
size in_state_tuition out_state_tuition id name size_setting zip region_id state_fips locale ownership year county cbsa puma net_cost net_cost_adjusted in_tuition_adjusted out_tuition_adjusted non_tuition_expenses_adj
0 5042.0 NaN NaN 131159 American University 17 20016-8001 2 11 11 2 1996 1.0 47900.0 101.0 NaN NaN NaN NaN NaN
1 18337.0 NaN NaN 100858 Auburn University 15 36849 5 1 13 1 1996 81.0 12220.0 2101.0 NaN NaN NaN NaN NaN
2 18337.0 NaN NaN 100858 Auburn University 15 36849 5 1 13 1 1996 81.0 10760.0 2101.0 NaN NaN NaN NaN NaN
3 10395.0 NaN NaN 223232 Baylor University 16 76798 6 48 12 2 1996 309.0 47380.0 3801.0 NaN NaN NaN NaN NaN
4 9203.0 NaN NaN 196079 Binghamton University 16 13850-6000 2 36 22 1 1996 7.0 13780.0 2201.0 NaN NaN NaN NaN NaN
In [6]:
locale_map = {
    11:	"City: Large",
    12:	"City: Midsize",
    13:	"City: Small",
    21:	"Suburb: Large",
    22:	"Suburb: Midsize",
    23:	"Suburb: Small",
    31:	"Town: Fringe",
    32:	"Town: Distant",
    33:	"Town: Remote",
    41:	"Rural: Fringe",
    42:	"Rural: Distant",
    43:	"Rural: Remote"
}
ownership_map = {
    1:	"Public",
    2:	"Private nonprofit",
    3:	"Private for-profit"
}
region_map = {
    0:	"U.S. Service Schools",
    1:	"New England",
    2:	"Mid East",
    3:	"Great Lakes",
    4:	"Plains",
    5:	"Southeast",
    6:	"Southwest",
    7:	"Rocky Mountains",
    8:	"Far West",
    9:	"Outlying Areas"
}

scard_df["region"] = scard_df.apply(lambda row: 
            region_map[row["region_id"]],
        axis=1
)
scard_df["ownership"] = scard_df.apply(lambda row: 
            ownership_map[row["ownership"]],
        axis=1
)
scard_df["locale"] = scard_df.apply(lambda row: 
            locale_map[row["locale"]],
        axis=1
)

We can note that some rows do not have cost data associated with them, thus they are missing data. Since we will use this cost data later in our analysis, we need to either interpolate the missing data or drop the invalid rows. Here, we experiment with dropping rows with missing data.

In [7]:
scard_clipped = scard_df.dropna(subset=["net_cost", "in_state_tuition", "out_state_tuition"]).copy()
#print(scard_clipped["name"].unique().shape)
#print(scard_clipped.to_string())

It seems as if the clipped dataframe after dropping null cost data is just the data after 2009. To verify that this is true, I try querying the original dataset purely by restricting the years. If there is complete cost data from 2009 to 2020, then the resulting dataframe should be equal to the dataframe resulting from dropping null data. Run the next code cell to confirm this.

In [8]:
scard_clipped_year = scard_df[scard_df["year"] >= 2009].copy()
scard_clipped_year.equals(scard_clipped)
Out[8]:
True
In [9]:
msa_df["msa"] = msa_df["msa"].str.replace("]", "").astype(int)
msa_df
/tmp/ipykernel_2287/1876715125.py:1: FutureWarning:

The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.

Out[9]:
name msa_income msa_rent msa year
0 CO Metro Area" 59007 1207 19740 2009
1 NC Metro Area" 49902 921 20500 2009
2 FL Metro Area" 37654 881 23540 2009
3 NC Metro Area" 41272 789 24660 2009
4 SC Metro Area" 43283 763 24860 2009
... ... ... ... ... ...
56 CA Metro Area" 77774 1774 31080 2019
57 WI Metro Area" 75545 1218 31540 2019
58 DC-VA-MD-WV Metro Area" 105659 1862 47900 2019
59 NC Metro Area" 52322 815 49180 2019
60 MA-CT Metro Area" 76348 1319 49340 2019

657 rows × 5 columns

In [10]:
scard_clipped_geo = scard_clipped.dropna(subset=["cbsa", "puma"]).copy()
scard_clipped_geo
Out[10]:
size in_state_tuition out_state_tuition id name size_setting zip region_id state_fips locale ... year county cbsa puma net_cost net_cost_adjusted in_tuition_adjusted out_tuition_adjusted non_tuition_expenses_adj region
1989 6430.0 34973.0 34973.0 131159 American University 17 20016-8001 2 11 City: Large ... 2009 1.0 47900.0 101.0 40345.0 18803.189093 16299.514987 16299.514987 2503.674106 Mid East
1990 19918.0 6972.0 19452.0 100858 Auburn University 15 36849 5 1 City: Small ... 2009 81.0 12220.0 2101.0 13225.0 6163.642973 3249.370042 9065.798345 2914.272931 Southeast
1991 19918.0 6972.0 19452.0 100858 Auburn University 15 36849 5 1 City: Small ... 2009 81.0 10760.0 2101.0 13225.0 6163.642973 3249.370042 9065.798345 2914.272931 Southeast
1992 12101.0 28070.0 28070.0 223232 Baylor University 16 76798 6 48 City: Midsize ... 2009 309.0 47380.0 3801.0 24174.0 11266.533477 13082.303082 13082.303082 -1815.769605 Southwest
1993 11635.0 6761.0 14661.0 196079 Binghamton University 16 13850-6000 2 36 Suburb: Midsize ... 2009 7.0 13780.0 2201.0 13999.0 6524.373382 3151.031391 6832.905076 3373.341992 Mid East
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3820 7366.0 57386.0 57386.0 179867 Washington University in St Louis 17 63130-4899 4 29 Suburb: Large ... 2020 510.0 41180.0 2001.0 26921.0 10400.208357 22169.546331 22169.546331 -11769.337974 Plains
3821 6217.0 23628.0 46854.0 231624 William & Mary 14 23187-8795 5 51 Suburb: Small ... 2020 830.0 47260.0 9500.0 19593.0 7569.231542 9128.045877 18100.789806 -1558.814335 Southeast
3822 4804.0 54416.0 54416.0 168421 Worcester Polytechnic Institute 13 01609-2280 1 25 City: Midsize ... 2020 27.0 49340.0 505.0 42835.0 16548.156642 21022.166263 21022.166263 -4474.009620 New England
3823 4701.0 57700.0 57700.0 130794 Yale University 17 06520 1 9 City: Midsize ... 2020 9.0 35300.0 20601.0 15296.0 5909.200514 22290.851833 22290.851833 -16381.651319 New England
3824 2619.0 46475.0 46475.0 197708 Yeshiva University 14 10033-3299 2 36 City: Large ... 2020 61.0 35620.0 4112.0 39536.0 15273.676223 17954.373292 17954.373292 -2680.697069 Mid East

1656 rows × 21 columns

In [11]:
scard_clipped_geo["puma"] = scard_clipped_geo["puma"].astype(int)
scard_clipped_geo["cbsa"] = scard_clipped_geo["cbsa"].astype(int)
print(scard_clipped_geo["puma"].unique().shape)
scard_clipped_geo
(111,)
Out[11]:
size in_state_tuition out_state_tuition id name size_setting zip region_id state_fips locale ... year county cbsa puma net_cost net_cost_adjusted in_tuition_adjusted out_tuition_adjusted non_tuition_expenses_adj region
1989 6430.0 34973.0 34973.0 131159 American University 17 20016-8001 2 11 City: Large ... 2009 1.0 47900 101 40345.0 18803.189093 16299.514987 16299.514987 2503.674106 Mid East
1990 19918.0 6972.0 19452.0 100858 Auburn University 15 36849 5 1 City: Small ... 2009 81.0 12220 2101 13225.0 6163.642973 3249.370042 9065.798345 2914.272931 Southeast
1991 19918.0 6972.0 19452.0 100858 Auburn University 15 36849 5 1 City: Small ... 2009 81.0 10760 2101 13225.0 6163.642973 3249.370042 9065.798345 2914.272931 Southeast
1992 12101.0 28070.0 28070.0 223232 Baylor University 16 76798 6 48 City: Midsize ... 2009 309.0 47380 3801 24174.0 11266.533477 13082.303082 13082.303082 -1815.769605 Southwest
1993 11635.0 6761.0 14661.0 196079 Binghamton University 16 13850-6000 2 36 Suburb: Midsize ... 2009 7.0 13780 2201 13999.0 6524.373382 3151.031391 6832.905076 3373.341992 Mid East
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3820 7366.0 57386.0 57386.0 179867 Washington University in St Louis 17 63130-4899 4 29 Suburb: Large ... 2020 510.0 41180 2001 26921.0 10400.208357 22169.546331 22169.546331 -11769.337974 Plains
3821 6217.0 23628.0 46854.0 231624 William & Mary 14 23187-8795 5 51 Suburb: Small ... 2020 830.0 47260 9500 19593.0 7569.231542 9128.045877 18100.789806 -1558.814335 Southeast
3822 4804.0 54416.0 54416.0 168421 Worcester Polytechnic Institute 13 01609-2280 1 25 City: Midsize ... 2020 27.0 49340 505 42835.0 16548.156642 21022.166263 21022.166263 -4474.009620 New England
3823 4701.0 57700.0 57700.0 130794 Yale University 17 06520 1 9 City: Midsize ... 2020 9.0 35300 20601 15296.0 5909.200514 22290.851833 22290.851833 -16381.651319 New England
3824 2619.0 46475.0 46475.0 197708 Yeshiva University 14 10033-3299 2 36 City: Large ... 2020 61.0 35620 4112 39536.0 15273.676223 17954.373292 17954.373292 -2680.697069 Mid East

1656 rows × 21 columns

In [12]:
scard_df = scard_clipped_geo
scard_df = pd.merge(scard_df, puma_df, left_on=["puma", "year"], right_on=["puma", "year"], how="inner")
scard_df["name"].unique().shape
Out[12]:
(70,)
In [13]:
scard_df = pd.merge(scard_df, msa_df, left_on=["cbsa", "year"], right_on=["msa", "year"], how="left")
scard_df["name_y"].unique().shape
Out[13]:
(33,)

Exploratory Analysis and Data Visualization


In this section, we will be analyzing different components of our dataset graphically to display relationships present within it. For example, in this next block, we look at a violin plot of the net cost of every university and how it changes every year. Since the purpose of a violin plot is to visualize a distribute of data, we are able to see the increasing volatility in the net cost of university as time goes on.

In [14]:
fig = go.Figure()
fig.add_trace(
    go.Violin(
        x=scard_df['year'], 
        y=scard_df['net_cost'],
        box_visible=True,
        meanline_visible=True
    )
)
fig.show()

We can observe that cost seems to be a bimodal distribution of cost, implying multple factors with significant variance. Lets keep exploring the data and see if this holds for other costs.

In [15]:
fig = go.Figure()
fig.add_trace(
    go.Violin(
        x=scard_df['year'], 
        y=scard_df['in_state_tuition'],
        box_visible=True,
        meanline_visible=True
    )
)
fig.show()
In [16]:
fig = go.Figure()
fig.add_trace(
    go.Violin(
        x=scard_df['year'], 
        y=scard_df['out_state_tuition'],
        box_visible=True,
        meanline_visible=True
    )
)
fig.show()

It seems as if this binomial distribution exists for all of our cost data for tuition. Let's look at whether public or private institutions behave differently.

In [17]:
fig = px.scatter(scard_df, x="year", y="in_state_tuition", facet_col="ownership")
fig.show()
In [18]:
fig = px.scatter(scard_df, x="year", y="in_tuition_adjusted", facet_col="ownership")
fig.show()
In [19]:
fig = px.scatter(scard_df, x="year", y="out_tuition_adjusted", facet_col="ownership")
fig.show()

Aha! It seems as if we have determined that there is a relationship between public and private ownership and net costs, as they visibly vary differntly. We can explore this in our analysis later. Next, lets look at our census data. MSA groupings are major geographical items for the census, and can be useful to compare against smaller items.

In [20]:
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
    go.Violin(x=msa_df["year"], 
              y=msa_df["msa_income"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=1
)
fig.add_trace(
    go.Violin(x=msa_df["year"], 
              y=msa_df["msa_rent"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=2
)

fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent ", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for MSA Adjusted for Inflation")
fig.show()

It seems as if rent and income across our MSA's are mostly normally distributed within years, and are varying with time. Let's do the same with PUMAs, smaller geographical items for the census and more representative of the immediate relevant area around our colleges.

In [21]:
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
    go.Violin(x=puma_df["year"], 
              y=puma_df["puma_income"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=1
)
fig.add_trace(
    go.Violin(x=puma_df["year"], 
              y=puma_df["puma_rent"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=2
)

fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for PUMA")
fig.show()

The same holds true for PUMAs! However, it's possible that inflation accounts for our yearly variance, lets normalize using an all items CPI and try again.

In [22]:
msa_df["msa_income_adj"] = msa_df.apply(lambda row: 
            (row["msa_income"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
msa_df["msa_rent_adj"] = msa_df.apply(lambda row: 
            (row["msa_rent"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["msa_income_adj"] = scard_df.apply(lambda row: 
            (row["msa_income"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["msa_rent_adj"] = scard_df.apply(lambda row: 
            (row["msa_rent"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["msa_rent_adj_ytd"] = scard_df.apply(lambda row: 
            (row["msa_rent"]/cpi_df.at[row["year"]]) * 100 * 12,
        axis=1
)


fig = make_subplots(rows=1, cols=2)
fig.add_trace(
    go.Violin(x=msa_df["year"], 
              y=msa_df["msa_income_adj"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=1
)
fig.add_trace(
    go.Violin(x=msa_df["year"], 
              y=msa_df["msa_rent_adj"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=2
)

fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent ", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for MSA Adjusted for Inflation")
fig.show()
In [23]:
puma_df["puma_income_adj"] = puma_df.apply(lambda row: 
            (row["puma_income"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
puma_df["puma_rent_adj"] = puma_df.apply(lambda row: 
            (row["puma_rent"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["puma_income_adj"] = scard_df.apply(lambda row: 
            (row["puma_income"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["puma_rent_adj"] = scard_df.apply(lambda row: 
            (row["puma_rent"]/cpi_df.at[row["year"]]) * 100,
        axis=1
)
scard_df["puma_rent_adj_ytd"] = scard_df.apply(lambda row: 
            (row["puma_rent"]/cpi_df.at[row["year"]]) * 100 * 12,
        axis=1
)

fig = make_subplots(rows=1, cols=2)
fig.add_trace(
    go.Violin(x=puma_df["year"], 
              y=puma_df["puma_income_adj"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=1
)
fig.add_trace(
    go.Violin(x=puma_df["year"], 
              y=puma_df["puma_rent_adj"], 
              box_visible=True,
              meanline_visible=True
             ),
    row=1, col=2
)

fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for PUMA Adjusted for Inflation")
fig.show()

There seems to be some fluctuation over time now, but it seems siginificantly less important now. What about the relationship between rent and and nearby university costs? Let's plot the relevant data and break it down by ownership of nearby institutions to see if the areas around public and private institutions behave differently.

In [24]:
fig = px.scatter(scard_df, x="net_cost_adjusted", y="msa_rent_adj_ytd", trendline="ols", facet_col="ownership")
fig.show()

The trend lines are different, so it does seem that our relationship for ownership also seems to affect rent data as well. Lets see if our distribution of rents across our PUMAs is normal for public and private institutions.

In [25]:
fig = px.violin(scard_df, x="ownership", y="puma_rent_adj_ytd")
fig.show()

It seems as if there is some other factor flattening the distribution for PUMAs surrounding private institutions. Lets look at locality and the density of the surrounding area to see if there is a relationship there.

In [26]:
fig = px.violin(scard_df, x="locale", y="puma_rent_adj_ytd")
fig.show()

While rent across different localities are mostly monopolar, rural and remote towns are bimodal, and larger localities are slightly verically skewed.


Analysis, Hypothesis Testing, and Machine Learning


Since monthly rent appears to increase linearly proportionately to household income, I will be using a Least Squares Linear Regression to highlight this relationship.

In [27]:
fig = px.scatter(msa_df, x="msa_rent_adj", y="msa_income_adj", trendline="ols")
fig.show()
In [28]:
results = px.get_trendline_results(fig)
print(results.px_fit_results.iloc[0].summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.810
Model:                            OLS   Adj. R-squared:                  0.810
Method:                 Least Squares   F-statistic:                     2797.
Date:                Fri, 19 May 2023   Prob (F-statistic):          1.32e-238
Time:                        17:34:17   Log-Likelihood:                -6045.3
No. Observations:                 657   AIC:                         1.209e+04
Df Residuals:                     655   BIC:                         1.210e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       6634.8486    361.556     18.351      0.000    5924.901    7344.796
x1            38.8033      0.734     52.891      0.000      37.363      40.244
==============================================================================
Omnibus:                        5.072   Durbin-Watson:                   1.704
Prob(Omnibus):                  0.079   Jarque-Bera (JB):                5.041
Skew:                          -0.215   Prob(JB):                       0.0804
Kurtosis:                       3.006   Cond. No.                     1.90e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.9e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [29]:
fig = px.scatter(puma_df, x="puma_rent_adj", y="puma_income_adj", trendline="ols")
fig.show()
In [30]:
results = px.get_trendline_results(fig)
print(results.px_fit_results.iloc[0].summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.716
Model:                            OLS   Adj. R-squared:                  0.716
Method:                 Least Squares   F-statistic:                     8616.
Date:                Fri, 19 May 2023   Prob (F-statistic):               0.00
Time:                        17:34:17   Log-Likelihood:                -33340.
No. Observations:                3419   AIC:                         6.668e+04
Df Residuals:                    3417   BIC:                         6.670e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5487.7969    211.505     25.946      0.000    5073.108    5902.486
x1            41.7831      0.450     92.823      0.000      40.901      42.666
==============================================================================
Omnibus:                        9.532   Durbin-Watson:                   1.979
Prob(Omnibus):                  0.009   Jarque-Bera (JB):               10.977
Skew:                          -0.060   Prob(JB):                      0.00413
Kurtosis:                       3.250   Cond. No.                     1.40e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

As can be seen by the OLS Regression Summary, the MSA plot has a much more accurate regression with an R-squared of 0.814 compared to PUMA's 0.721, but this can be explained by PUMA having significantly more data points (PUMAs are more specific than MSAs) and the variance among the data is significantly wider (For example, the monthly rent for MSA does not even exceed 1000, but the montly rent for PUMA goes well above 1100). This, as well as many other factors, shows a strong tendency for universities to affect pricing much more on a small scale, such as a city level, than a county or metropolitan-center-wide scale.

Multivariate Analysis¶

Lets try to combine the relationships we found between cost, ownership, and locality in our EDA into a model that describes the relationship between our colleges and the Public Use Microdata Areas and MSAs surrounding them. To do this, we will make multiple models that represent different possible relationships, and then fit those models to our data. This next cell will generate and fit our models.

In [31]:
puma_rent_model = smf.ols(formula="puma_rent_adj ~ year * ownership + in_tuition_adjusted", data=scard_df)
puma_rent_res = puma_rent_model.fit()

msa_rent_model = smf.ols(formula="msa_rent_adj ~ year * ownership + in_tuition_adjusted", data=scard_df)
msa_rent_res = msa_rent_model.fit()

prm2 = smf.ols(formula="puma_rent_adj ~ year * ownership", data=scard_df)
prm2_res = prm2.fit()

prm3 = smf.ols(formula="puma_rent_adj ~ year * ownership + locale", data=scard_df)
prm3_res = prm3.fit()

num = len(scard_df["year"].values)
df_samples = scard_df[["year", "ownership", "in_tuition_adjusted"]]
df_samples = sm.add_constant(df_samples)
sample_set = np.column_stack(
    (
        np.ones(num), 
        scard_df["year"].values, 
        scard_df["ownership"].values,
        scard_df["in_tuition_adjusted"].values,
    )
)

df_samples2 = scard_df[["year", "ownership"]]
df_samples2 = sm.add_constant(df_samples)
sample_set2 = np.column_stack(
    (
        np.ones(num), 
        scard_df["year"].values, 
        scard_df["ownership"].values,
    )
)
scard_df["puma_rent_fit"] = puma_rent_res.predict(df_samples)
scard_df["puma_rent_fit2"] = prm2_res.predict(df_samples)
scard_df["msa_rent_fit"] = msa_rent_res.predict(df_samples)

Lets plot the first model we created, assuming a relationship between year, cost, and ownership in determining rents.

In [32]:
fig = make_subplots(rows=2, cols=1)

fig.update_layout(height=800, width=800, title_text="Interaction Model Public Use MicroData Area Rent Distribution and Trend Lines")
for i,id in enumerate (scard_df["ownership"].unique()):
    sub_df = scard_df[scard_df["ownership"] == id].copy().sort_values(by="year")

    trend_df = sub_df.groupby("year")["puma_rent_fit"].mean()
    fig.add_trace(
        go.Violin(
            x=sub_df['year'], 
            y=sub_df['puma_rent_adj'],
            name=id,
        ),
        row=(i+1), col=1
    )
    fig.add_trace(
        go.Scatter(
            x=trend_df.index, 
            y=trend_df,
            name=id,
        ),
        row=(i+1), col=1
    )
fig.show()

The variancy by years is negligible, but it seems as if the rents for public university communities are trending downwards, and private, upwards. Lets look at our MSA data as well to see if this is reflected in broader geographical regions.

In [33]:
fig = make_subplots(rows=2, cols=1)

fig.update_layout(height=800, width=800, title_text="Interaction Model Metropolitan Statistical Area Rent Distribution and Trend Lines")
for i,id in enumerate (scard_df["ownership"].unique()):
    sub_df = scard_df[scard_df["ownership"] == id].copy().sort_values(by="year")

    trend_df = sub_df.groupby("year")["msa_rent_fit"].mean()
    fig.add_trace(
        go.Violin(
            x=sub_df['year'], 
            y=sub_df['msa_rent_adj'],
            name=id,
        ),
        row=(i+1), col=1
    )
    fig.add_trace(
        go.Scatter(
            x=trend_df.index, 
            y=trend_df,
            name=id,
        ),
        row=(i+1), col=1
    )
fig.show()

Interestingly, the prior relationship is inverted, and the private data seems to be somewhat bimodal, implying we are missing a variable. From now on, lets analyze our OLS summaries and p values for significance.

In [34]:
print(puma_rent_res.summary())
print(puma_rent_res.pvalues)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          puma_rent_adj   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     3.598
Date:                Fri, 19 May 2023   Prob (F-statistic):            0.00619
Time:                        17:34:17   Log-Likelihood:                -31199.
No. Observations:                4777   AIC:                         6.241e+04
Df Residuals:                    4772   BIC:                         6.244e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                  678.5107   2084.476      0.326      0.745   -3408.024    4765.046
ownership[T.Public]       4648.0332   3175.094      1.464      0.143   -1576.615    1.09e+04
year                        -0.1306      1.038     -0.126      0.900      -2.165       1.904
year:ownership[T.Public]    -2.2918      1.577     -1.453      0.146      -5.384       0.801
in_tuition_adjusted          0.0013      0.001      1.402      0.161      -0.001       0.003
==============================================================================
Omnibus:                     1126.099   Durbin-Watson:                   1.669
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2423.724
Skew:                           1.361   Prob(JB):                         0.00
Kurtosis:                       5.183   Cond. No.                     2.15e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.15e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept                   0.744812
ownership[T.Public]         0.143286
year                        0.899891
year:ownership[T.Public]    0.146330
in_tuition_adjusted         0.160856
dtype: float64

Our first model analyzing cost, years, and ownership has a good F statistic of 0.006, implying that the model that we have is significant,however, the pvalues of our variables individually do not reach the level of significance of p: < 0.05. Lets try a different model, this time without tuition costs.

In [35]:
print(prm2_res.summary())
print(prm2_res.pvalues)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          puma_rent_adj   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     4.141
Date:                Fri, 19 May 2023   Prob (F-statistic):            0.00610
Time:                        17:34:17   Log-Likelihood:                -31200.
No. Observations:                4777   AIC:                         6.241e+04
Df Residuals:                    4773   BIC:                         6.243e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                 -276.5553   1970.291     -0.140      0.888   -4139.234    3586.123
ownership[T.Public]       5307.0927   3140.441      1.690      0.091    -849.619    1.15e+04
year                         0.3556      0.978      0.364      0.716      -1.562       2.273
year:ownership[T.Public]    -2.6279      1.559     -1.685      0.092      -5.685       0.429
==============================================================================
Omnibus:                     1120.860   Durbin-Watson:                   1.669
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2407.705
Skew:                           1.356   Prob(JB):                         0.00
Kurtosis:                       5.178   Cond. No.                     3.14e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.14e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept                   0.888379
ownership[T.Public]         0.091109
year                        0.716213
year:ownership[T.Public]    0.091998
dtype: float64

This model with only years and ownership provides a slightly better F statistic, thus is slightly more predictive, and its variable's p values are closer to 0.05, however, they still do not reach the level of clear statistical significance. This seems to be the right track though. Lets see what happens when incorporating locale into our model.

In [36]:
print(prm3_res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          puma_rent_adj   R-squared:                       0.076
Model:                            OLS   Adj. R-squared:                  0.074
Method:                 Least Squares   F-statistic:                     43.48
Date:                Fri, 19 May 2023   Prob (F-statistic):           1.50e-75
Time:                        17:34:17   Log-Likelihood:                -31017.
No. Observations:                4777   AIC:                         6.205e+04
Df Residuals:                    4767   BIC:                         6.212e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                  1367.7492   1900.237      0.720      0.472   -2357.592    5093.090
ownership[T.Public]        6119.0853   3030.270      2.019      0.044     178.357    1.21e+04
locale[T.City: Midsize]     -15.0368      7.507     -2.003      0.045     -29.755      -0.319
locale[T.City: Small]       -87.6213      6.972    -12.568      0.000    -101.289     -73.954
locale[T.Rural: Fringe]      14.5834     37.292      0.391      0.696     -58.526      87.693
locale[T.Suburb: Large]      -4.5794      8.502     -0.539      0.590     -21.247      12.088
locale[T.Suburb: Midsize]   -79.0237      7.582    -10.423      0.000     -93.887     -64.160
locale[T.Town: Remote]     -155.8319     11.912    -13.082      0.000    -179.186    -132.478
year                         -0.4390      0.943     -0.465      0.642      -2.289       1.410
year:ownership[T.Public]     -3.0348      1.505     -2.017      0.044      -5.985      -0.085
==============================================================================
Omnibus:                     1146.717   Durbin-Watson:                   1.790
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2518.813
Skew:                           1.375   Prob(JB):                         0.00
Kurtosis:                       5.258   Cond. No.                     3.14e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.14e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [37]:
print(prm3_res.pvalues)
Intercept                    4.716967e-01
ownership[T.Public]          4.350981e-02
locale[T.City: Midsize]      4.523786e-02
locale[T.City: Small]        1.154238e-35
locale[T.Rural: Fringe]      6.957702e-01
locale[T.Suburb: Large]      5.901630e-01
locale[T.Suburb: Midsize]    3.624453e-25
locale[T.Town: Remote]       1.913221e-38
year                         6.416813e-01
year:ownership[T.Public]     4.375242e-02
dtype: float64

As can be seen by the OLS Regression Summary, this model incorporating locale, ownership, and year has a much more accurate regression with a near zero F statistic, and with almost all of its variables crossing the threshold into statistical significance. This would suggest that the interaction between rents in the region surrounding a college, and the characteristics of that college and region, vary strongly based on the locality of the university, and the ownership of the college. This makes sense going by locality, as a university in a major city might increase rent, but will overall have a less drastic effect, wheras the significance of the city will also likely mean more universities will be hosted there, leading to a smaller coefficient. On the other hand, a university setting up shop in a small town would much more significantly increase the housing demand and value in the area, leading to a larger coefficient.


Insight and Policy Decision


Based on our observations throughout our project, we can conclude:

  1. There is a correlation between median income and median housing costs
  2. As time continues, both median income and median housing costs will likely continue rising due to inflation.
  3. Localities near universities are more volatile and are often significantly more expensive than the housing further out from them, creating a heavily gentrified environment (We can see an example of this in College Park)

If you would like to learn more about the tools we used to make this:

  • Census API
  • Andrew G. Reiter, “U.S. News & World Report Historical Liberal Arts College and University Rankings,” available at: http://andyreiter.com/datasets/
  • College Scorecard API
  • Zip To Tract API/Excel Doc
  • Department of Labor Beta Labs Historical CPI Data